† Corresponding author. E-mail:
Project supported by the National Natural Science Foundation of China (Grant No. 31570722).
Secondary structures of RNAs are the basis of understanding their tertiary structures and functions and so their predictions are widely needed due to increasing discovery of noncoding RNAs. In the last decades, a lot of methods have been proposed to predict RNA secondary structures but their accuracies encountered bottleneck. Here we present a method for RNA secondary structure prediction using direct coupling analysis and a remove-and-expand algorithm that shows better performance than four existing popular multiple-sequence methods. We further show that the results can also be used to improve the prediction accuracy of the single-sequence methods.
Secondary structures of RNAs are widely used in the studies of their tertiary structures and functions and become more and more important due to the discovery of increasing number of noncoding RNAs in biological processes.[1–6] Many methods of RNA secondary structure prediction have been proposed in the last decades.[7,8] These methods can be divided into two categories: one is based on single sequence,[9,10] and the other is on multiple sequences.[7] For the single-sequence approach, the most widely used methods are based on Nissinov algorithm[11] and minimum free energy (MFE) model.[10,12–14] The accuracies of these methods are about 70% in both precision (PPV) and sensitivity (STY).[15,16] For the multiple-sequence approach, structure conservation between homologous sequences from the same RNA family is utilized to infer their common secondary structure.[7] The PPVs of these methods are usually higher than those based on free-energy minimization, about 70%–80%, but the STYs are lower and usually less than 70%.[7,17] Therefore, accurate prediction of RNA secondary structure remains a challenge.
The secondary structure of an RNA is a specific base-pairing pattern formed by the canonical and wobble base pairs (A-U, G-C, and G-U) in its native structure. For convenience, we call these three types of base pairs as the standard base pairs in the following. Pseudoknots are not considered in this work and will be considered as tertiary interactions. The challenge of secondary structure prediction is to find this specific base-pairing pattern from a huge number of possible ones. The MFE model assumes that this specific base-pairing pattern is a minimum free energy state but in practice it is difficult to obtain accurate free energy of a base-pairing pattern. However, it was shown that the accuracy of the single-sequence MFE method could be increased by experimental constraints, like SHAPE data.[18,19] On the other hand, for the multiple-sequence approach, the base-pairing pattern might be determined from coevolutionary base pairs (co-pairs for short) inferred from the homologous sequences of the target RNA.[20,21] At present, there are many methods for inferring the co-pairs, like direct coupling analysis (DCA).[22–24] If using the co-pairs inferred by these methods to directly predict RNA secondary structures, the performance is only comparable to the existing methods. However, it was shown that combining with the generalized Nussinov algorithm, DCA could predict RNA secondary structures with significant improvement in STY without reducing PPV in comparison with the mutual-information (MI) method according to the test on six RNAs.[20]
Here we present a method that combines DCA with a remove-and-expand algorithm to predict RNA secondary structure and benchmark it on a large test set. We call our method as 2dRNAdca, in consistent with our prediction method of three-dimensional structure of RNA, 3dRNA.[21,25–27] Furthermore, we show that the results of 2dRNAdca can be used to improve the prediction accuracy of the single-sequence MFE method.
The 2dRNAdca is based on the co-pairs inferred from the homologous sequences of the target RNA by DCA.[22–24] There are other models to infer the co-pairs, e.g., using MI of a multiple sequence alignment.[28] The shortage of MI is that the predicted co-pairs contain many pairwise residues that are not in direct contacts in tertiary structure.[24,29] DCA was proposed to disentangle direct contacts from indirect ones.[24]
The basic principle of DCA is briefly presented according to the detailed description in our previously published papers.[21,30] For a target RNA sequence, we can do multiple sequence alignment (MSA) with its homologous sequences from the same RNA family. The MSA can be represented as
DCA calculates a score for each residue pair in the target RNA sequence. Usually, the pairs of the N largest scores are considered as co-pairs, e.g., with N equal to or less than the length of the target RNA. The DCA scores for the target RNA can be input by users or can be calculated by using the option in our 2dRNAdca web server. Multiple sequence alignment for a target RNA sequence is required for prediction of co-pairs using DCA and is generated from Rfam database[31,32] in this work. There are other DCA algorithms, e.g., using a global probability model through pseudo-likelihood maximization (plmDCA)[33,34] to infer the co-pairs.
The co-pairs inferred by DCA usually include many non-standard base pairs, one-to-many base pairs, pseudoknotted base pairs, and single base pair (helix with only one base pair). Since non-standard base pairs and pseudoknotted base pairs are not considered in this work and one-to-many base pairs and single base pair rarely occur in a RNA secondary structure,[35] they can be considered as false positives. 2dRNAdca adopts a “remove-and-expand” algorithm to remove these false positives among the co-pairs inferred by DCA and it includes two steps:
Removing step. Take the top N standard (A-U, G-C, and G-U) co-pairs from the co-pairs inferred by DCA and remove the single base pair and one-to-many base pairs in them. However, if a single base pair can be expanded (see the following expanding step), it will be retained. Similarly, if some base pairs in one-to-many base pairs can be expanded, the one that can form the longest stem will be retained.
Expanding step. The remaining co-pairs after the removing step are considered as predicted base pairs of the target RNA and can be taken as the “cores” of the native secondary structure. Expand or maximize the standard base pairs from the predicted unpaired bases if they can form a continuous stacking with the predicted base pairs. During the expanding process, the hairpin loops are kept to have at least three bases. During expanding, if one base can form one more base pairs with different bases, the algorithm chooses the one that can form the longest stem. After the expanding process, the pseudoknotted base pairs are removed. The final structure is considered as the predicted secondary structure of the target sequence.
In Fig.
Two test sets are used in this work. The test set I consists of six RNAs (PDB IDs: 1Y26, 2GDI, 2GIS, 3IRW, 3VRS, 3OWI) (Table
![]() | Table 1. Performance of mfDCA and 2dRNAdca on the test set I. . |
![]() | Table 2. Parameters of the test set II of 94 RNAs. . |
We use precision (PPV), sensitivity (STY), and Matthews correlation coefficients (MCC)[36,37] to measure the performance of the methods to predict RNA secondary structures as usual.[7,17] PPV measures the fraction of predictions that are native base pairs, STY measures the fraction of native base pairs that are predicted out, and MCC is a balanced measure of PPV and STY. They are defined as follows:
Table
![]() | Fig. 2. The performances of 2dRNAdca, DCA*, and mfDCA for six RNAs. DCA* is the method used in Ref. [20] that combined mfDCA with a generalized Nussinov algorithm. The mean values of PPV, STY, and MCC of DCA* for the six RNAs were calculated according to Fig. |
The 2dRNAdca is further benchmarked on the test set II (Table
We first study how the performances of mfDCA and 2dRNAdca change with the top N standard co-pairs (Fig.
![]() | Fig. 3. The performance of (a) mfDCA and (b) 2dRNAdca vs. the top N standard co-pairs with N = 0.1L, 0.2L, II, L over the test set II. L is the length of RNA. |
![]() | Table 3. Performance of different methods of RNA secondary structure prediction over the test set II. . |
For comparison, Table
![]() | Fig. 4. The performances of 2dRNAdca, CentroidAlifod, MXScarma, RNAalifold, Turbofold, and mfDCA on the test set II. |
Table
![]() | Table 4. Performance of 2dRNA for different types of RNAs in the test set II. . |
Finally, we show that the results of 2dRNAdca can be used to improve the accuracy of single-sequence MFE methods. Here, we use the predicted base pairs of 2dRNAdca as the constraints for the MFE methods, including Mfold,[10] RNAfold,[13,44] and RNAstructure.[12] The predictions can be done by using the default parameters on the 2dRNAdca web server and the results (2dRNAdca + Mfold, 2dRNAdca + RNAfold, 2dRNAdca + RNAstructure) are also presented in Table
It is known that the performance of multiple-sequences methods usually depends on the number of available homologous sequences. However, figure
![]() | Fig. 5. The scatter plot of performances of (a) 2dRNAdca and (b) mfDCA with the number of homologous sequences in the test set II. |
Figure
![]() | Fig. 6. The scatter plot of performances of (a) 2dRNAdca and (b) mfDCA with the sequence length for the test set II. |
The performance of 2dRNAdca depends not only on the accuracy of mfDCA in inference of co-pairs, i.e., the number of true positives (native base pairs) in the top 0.3L co-pairs, but also whether these true positives distribute to all the stems (Fig.
In summary, we propose a method, 2dRNAdca, for predicting RNA secondary structure by combining DCA with a remove-and-expand algorithm. The benchmark results show that 2dRNAdca can significantly increase the performance of mfDCA and it also performs better than existing popular multiple-sequences methods. Furthermore, we show that the results of 2dRNAdca can be used to improve the prediction accuracy of the single-sequence method. It is expected that the performance of 2dRNAdca will improve as the accuracy of DCA improves.
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] | |
[36] | |
[37] | |
[38] | |
[39] | |
[40] | |
[41] | |
[42] | |
[43] | |
[44] |